Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

where θ and λ are hyper parameters, ^⃗M = {M ¹, ..., M ^N} are M-Filters, and ^ˆC is the

binarized ﬁlter set across all layers. Operation ◦deﬁned in Eq. 3.12 is used to approximate

unbinarized ﬁlters based on binarized ﬁlters and M-Filters, leading to ﬁlter loss as the ﬁrst

term on the right of Eq. 3.18. The second term on the right is similar to the center loss

used to evaluate intraclass compactness, which deals with the feature variation caused by

the binarization process. fm( ^ˆC, ^⃗M) denotes the feature map of the last convolutional layer

for the mth sample, and f( ^ˆC, ^⃗M) denotes the class-speciﬁc mean feature map of previous

samples. We note that the center loss is successfully deployed to handle feature variations.

We only keep the binarized ﬁlters and the shared M-Filters (quite small) to reduce the

storage space to calculate the feature maps after training. We consider the conventional

loss and then deﬁne a new loss function LS,M = LS + LM, where LS is the conventional

loss function, e.g., softmax loss.

Again, we consider the quantization process in our loss LS,M, and obtain the ﬁnal

minimization objective as:

L(C, ^ˆC, M) = LS,M + ^θ

2^∥^C^[^k^]⁻^C⁻^ηδ^[^k^]

C ^∥²^,

(3.19)

where θ is shared with Eq. 3.18 to reduce the number of parameters. δ^[^k^]

is the gradient

of LS,M with respect to C^[^k^]. Unlike conventional methods (such as XNOR), where only

the ﬁlter reconstruction is considered in the weight calculation, our discrete optimization

method provides a comprehensive way to calculate binarized CNNs by considering ﬁlter

loss, softmax loss, and feature compactness in a uniﬁed framework.

3.4.3

Back-Propagation Updating

In MCNs, unbinarized ﬁlters Ci and M-Filters M must be learned and updated. These two

types of ﬁlters are jointly learned. In each convolutional layer, MCNs sequentially update

unbinarized ﬁlters and M-Filters.

Updating unbinarized ﬁlters: The gradient δ ˆ

C ^{corresponding to}^Cⁱ^{is deﬁned as}

δ ˆ

C ⁼^∂L

∂^ˆCi

= ^∂L^S

∂^ˆCi

+ ^∂L^M

∂^ˆCi

+ θ( C^[^k^]−C^[^k^]−η1δ^[^k^]

C ⁾^,

(3.20)

Ci ←Ci −η1δ ˆ

C^,

(3.21)

where L, LS, and LM are loss functions, and η1 is the learning rate. Furthermore, we have

the following.

∂LS

∂^ˆCi

= ^∂L^S

∂Q ^·^∂Q

∂^ˆCi

∂LS

∂Qij

· M

′

j^,

(3.22)

∂LM

∂^ˆCi

= θ

(Ci −^ˆCi ◦Mj) ◦Mj,

(3.23)

Updating M-Filters: We further update the M-Filter M with C ﬁxed. δM is deﬁned as

the gradient of M, and we have:

δM = ^∂L

∂M ⁼^∂L^S

∂M ⁺^∂L^M

∂M ^,

(3.24)

M ←|M −η2δM|,

(3.25)